This dataset contains information on a sample of 1,000 Uber rides in San Diego between 8pm-12am
Variables:
Calculate some descriptive statistics of ride price per mile in the data.
result <- explore(
uber,
vars = "price",
fun = c("mean", "sd", "min", "max"),
nr = 1
)
dtab(result) %>% render()Show a frequency table for fs (i.e., is it Friday/Saturday or not) and ccon (i.e., is Comic Con in town or not)
What do the correlations between the different variables look like.
Correlation
Data : uber
Method : Pearson
Variables : price, time, fs, ccon
Null hyp. : variables x and y are not correlated
Alt. hyp. : variables x and y are correlated
** Variables of type {factor} included without adjustment **
Correlation matrix:
price time fs
time -0.05
fs 0.65 0.03
ccon 0.42 -0.00 0.02
p.values:
price time fs
time 0.12
fs 0.00 0.40
ccon 0.00 0.95 0.61
time) of an Uber ride affect the price?Lets start by visualizing if there seems to be a link between the time and price variables in the data.
visualize(
uber,
xvar = "time",
yvar = "price",
type = "scatter",
nrobs = -1,
check = "line",
custom = FALSE
)Looking at the graph, the line seems to be downward slopping but the effect looks pretty minimal.
We will use linear regression (Model > Linear regression) to estimate the effect of time on price and generate a prediction plot.
\[ price = a + b \times time \]
Linear regression (OLS)
Data : uber
Response variable : price
Explanatory variables: time
Null hyp.: the effect of time on price is zero
Alt. hyp.: the effect of time on price is not zero
coefficient std.error t.value p.value
(Intercept) 2.3388 0.0215 108.9574 < .001 ***
time -0.0002 0.0002 -1.5364 0.1248
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.0024, Adjusted R-squared: 0.0014
F-statistic: 2.3605 df(1,998), p.value 0.1248
Nr obs: 1,000
Although the slope in this (prediction) plot looks more impressive this is a bit misleading due to the Y-axis scale (i.e., compare to the scatter plot). The regression coefficient for time is negative but not statistically significant (i.e., p-value > 0.05)
We will use linear regression (Model > Linear regression) to estimate the effect of fs on price, controlling for time, and generate a prediction plot.
Linear regression (OLS)
Data : uber
Response variable : price
Explanatory variables: time, fs
Null hyp.: the effect of x on price is zero
Alt. hyp.: the effect of x on price is not zero
coefficient std.error t.value p.value
(Intercept) 2.2070 0.0169 130.3290 < .001 ***
time -0.0003 0.0001 -2.7544 0.0060 **
fs|yes 0.4784 0.0175 27.3502 < .001 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.43, Adjusted R-squared: 0.4289
F-statistic: 376.0794 df(2,997), p.value < .001
Nr obs: 1,000
pred <- predict(
result,
pred_cmd = c("time = 0:240", "fs = c('no', 'yes')")
)
plot(pred, xvar = "time", color = "fs")The plot show a significant difference between prices charged on Friday and Saturday evening, compared to the other days of the week. The slope of the lines is, by assumption, the same in this model.
Now that we have included time and fs in the model time does seem to have a statistically significant downward effect.
coeff <- result$coeff$coefficient %>%
set_names(result$coeff$label)
visualize(
uber,
xvar = "time",
yvar = "price",
type = "scatter",
custom = TRUE
) +
geom_abline(
aes(
intercept = coeff["(Intercept)"],
slope = coeff["time"],
color = "blue"
),
size = 1
) +
geom_abline(
aes(
intercept = coeff["(Intercept)"] + coeff["fs|yes"],
slope = coeff["time"],
color = "red"
),
size = 1
) +
scale_color_manual(
name = "Friday/Saturday",
values = c(blue = "blue", red = "red"),
labels = c("no", "yes")
) +
theme(legend.position = "bottom")Another way to state the question is to ask if it is reasonable that the slope of time in the previous prediction plot is the same for rides on Friday and Saturday as it is on other days.
If we assume that Uber is frequently used in the (late) evening by users visiting bars downtown, this may assumption of “equal slopes” may not be reasonable.
We will use linear regression (Model > Linear regression) to estimate the effect of fs, time, and interaction between fs and time on price and generate a prediction plot.
Linear regression (OLS)
Data : uber
Response variable : price
Explanatory variables: time, fs
Null hyp.: the effect of x on price is zero
Alt. hyp.: the effect of x on price is not zero
coefficient std.error t.value p.value
(Intercept) 2.283 0.019 120.289 < .001 ***
time -0.001 0.000 -6.892 < .001 ***
fs|yes 0.243 0.034 7.130 < .001 ***
time:fs|yes 0.002 0.000 7.965 < .001 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.464, Adjusted R-squared: 0.463
F-statistic: 287.567 df(3,996), p.value < .001
Nr obs: 1,000
pred <- predict(
result,
pred_cmd = c("time = 0:240", "fs = c('no', 'yes')")
)
plot(pred, xvar = "time", color = "fs")Creating a scatter plot of the model results
coeff <- result$coeff$coefficient %>%
set_names(result$coeff$label)
visualize(
uber,
xvar = "time",
yvar = "price",
type = "scatter",
custom = TRUE
) +
geom_abline(
aes(
intercept = coeff["(Intercept)"],
slope = coeff["time"],
color = "blue"
),
size = 1
) +
geom_abline(
aes(
intercept = coeff["(Intercept)"] + coeff["fs|yes"],
slope = coeff["time"] + coeff["time:fs|yes"],
color = "red"
),
size = 1
) +
scale_color_manual(
name = "Friday/Saturday",
values = c(blue = "blue", red = "red"),
labels = c("no", "yes")
) +
theme(legend.position = "bottom")The prediction plot provides compelling evidence that the “equal slopes” assumption may not be reasonable. We draw a similar conclusion by looking at a scatter plot of price and time with color set to fs.
visualize(
uber,
xvar = "time",
yvar = "price",
type = "scatter",
nrobs = -1,
color = "fs",
check = "line",
custom = FALSE
)We will use linear regression (Model > Linear regression) to estimate the effect of time, fs, ccon on price and generate a prediction plot. To simplify the interpretation lets remove the interaction between time and fs.
Linear regression (OLS)
Data : uber
Response variable : price
Explanatory variables: time, fs, ccon
Null hyp.: the effect of x on price is zero
Alt. hyp.: the effect of x on price is not zero
coefficient std.error t.value p.value
(Intercept) 2.1486 0.0145 148.4938 < .001 ***
time -0.0003 0.0001 -3.2402 0.0012 **
fs|yes 0.4735 0.0147 32.3049 < .001 ***
ccon|yes 0.3800 0.0185 20.5937 < .001 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.6002, Adjusted R-squared: 0.599
F-statistic: 498.4851 df(3,996), p.value < .001
Nr obs: 1,000
pred <- predict(
result,
pred_cmd = c("time = 0:240", "ccon = c('no', 'yes')")
)
plot(pred, xvar = "time", color = "ccon")The prediction plot provides compelling evidence that rides are more expensive during Comic Con.
Another way to state the question is to ask if it is reasonable that the difference in price between Friday/Saturday and on other days is the same when Comic Con is (not) in town.
We will use linear regression (Model > Linear regression) to estimate the effect of time, fs, and ccon and interaction between fs and ccon on price and generate a prediction plot.
result <- regress(
uber,
rvar = "price",
evar = c("time", "fs", "ccon"),
int = "fs:ccon"
)
summary(result, dec = 4)Linear regression (OLS)
Data : uber
Response variable : price
Explanatory variables: time, fs, ccon
Null hyp.: the effect of x on price is zero
Alt. hyp.: the effect of x on price is not zero
coefficient std.error t.value p.value
(Intercept) 2.1583 0.0145 148.9736 < .001 ***
time -0.0003 0.0001 -3.3800 < .001 ***
fs|yes 0.4446 0.0158 28.0655 < .001 ***
ccon|yes 0.3241 0.0220 14.7255 < .001 ***
fs|yes:ccon|yes 0.1796 0.0395 4.5488 < .001 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.6084, Adjusted R-squared: 0.6068
F-statistic: 386.428 df(4,995), p.value < .001
Nr obs: 1,000
pred <- predict(
result,
pred_cmd = c("time = 0:240", "fs = c('no', 'yes')", "ccon = c('no', 'yes')")
)
plot(pred, xvar = "time", facet_col = "ccon", color = "fs")predict(
result,
pred_cmd = c("time = 0", "fs = c('no', 'yes')", "ccon = c('no', 'yes')"),
dec = 3
)coeff <- result$coeff$coefficient %>%
set_names(result$coeff$label)
visualize(
uber,
xvar = "time",
yvar = "price",
type = "scatter",
custom = TRUE
) +
geom_abline(
aes(
intercept = coeff["(Intercept)"] + coeff["fs|yes"] + coeff["ccon|yes"],
slope = coeff["time"],
color = "blue"
),
size = 1
) +
geom_abline(
aes(
intercept = coeff["(Intercept)"],
slope = coeff["time"],
color = "black"
),
size = 1
) +
geom_abline(
aes(
intercept = coeff["(Intercept)"] + coeff["fs|yes"],
slope = coeff["time"],
color = "red"
),
size = 1
) +
geom_abline(
aes(
intercept = coeff["(Intercept)"] + coeff["ccon|yes"],
slope = coeff["time"],
color = "green"
),
size = 1
) +
scale_color_manual(
name = "Friday/Saturday",
values = c(blue = "blue", red = "red", black = "black", green = "green"),
labels = c("no fs + no ccon", "fs + ccon", "no fs + ccon", "fs + no ccon")
) +
theme(legend.position = "bottom")The previous model was missing the important interaction effect between time and fs. Lets now add that back in. This final model can be used to answer each of the previous question.
result <- regress(
uber,
rvar = "price",
evar = c("time", "fs", "ccon"),
int = c("time:fs", "fs:ccon")
)
summary(result)Linear regression (OLS)
Data : uber
Response variable : price
Explanatory variables: time, fs, ccon
Null hyp.: the effect of x on price is zero
Alt. hyp.: the effect of x on price is not zero
coefficient std.error t.value p.value
(Intercept) 2.230 0.016 139.502 < .001 ***
time -0.001 0.000 -8.098 < .001 ***
fs|yes 0.224 0.029 7.843 < .001 ***
ccon|yes 0.322 0.021 15.221 < .001 ***
time:fs|yes 0.002 0.000 9.103 < .001 ***
fs|yes:ccon|yes 0.174 0.038 4.585 < .001 ***
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
R-squared: 0.639, Adjusted R-squared: 0.637
F-statistic: 351.15 df(5,994), p.value < .001
Nr obs: 1,000
pred <- predict(
result,
pred_cmd = c("time = 0:240", "fs = c('no', 'yes')", "ccon = c('no', 'yes')")
)
plot(pred, xvar = "time", facet_col = "ccon", color = "fs")Finally, lets visualize the effects in a scatter plot. Looks a lot like the prediction plots from the last regression
visualize(
uber,
xvar = "time",
yvar = "price",
type = "scatter",
nrobs = -1,
facet_col = "ccon",
color = "fs",
check = "line",
custom = FALSE
)result <- nn(
uber,
rvar = "price",
evar = c("time", "fs"),
type = "regression",
size = 2,
decay = 0.2,
seed = 1234
)
summary(result, prn = TRUE)Neural Network
Activation function : Linear (regression)
Data : uber
Response variable : price
Explanatory variables: time, fs
Network size : 2
Parameter decay : 0.2
Seed : 1234
Network : 2-2-1 with 9 weights
Nr obs : 1,000
Weights :
b->h1 i1->h1 i2->h1
-1.25 0.92 0.87
b->h2 i1->h2 i2->h2
0.50 -1.13 2.12
b->o h1->o h2->o
-1.46 1.32 1.54
pred <- predict(result, pred_cmd = c("time = 0:240", "fs = c('yes', 'no')"))
plot(pred, xvar = "time", color = "fs")Neural Network
Data : uber
Response variable : price
Explanatory variables: time, fs
Prediction command : time = 0:240, fs = c('yes', 'no')
Rows shown : 10 of 482
time fs Prediction
0 yes 2.537
1 yes 2.538
2 yes 2.539
3 yes 2.539
4 yes 2.540
5 yes 2.541
6 yes 2.542
7 yes 2.543
8 yes 2.544
9 yes 2.544
visualize(
uber,
xvar = "time",
yvar = "price",
type = "scatter",
custom = TRUE
) +
geom_line(data = pred, aes(x = time, y = Prediction, color = fs), size = 1) +
labs(color = "Friday/Saturday") +
scale_color_manual(values=c("red", "blue")) +
theme(legend.position = "bottom")result <- nn(
uber,
rvar = "price",
evar = c("time", "fs", "ccon"),
type = "regression",
seed = 1234
)
summary(result, prn = TRUE)Neural Network
Activation function : Linear (regression)
Data : uber
Response variable : price
Explanatory variables: time, fs, ccon
Network size : 1
Parameter decay : 0.5
Seed : 1234
Network : 3-1-1 with 6 weights
Nr obs : 1,000
Weights :
b->h1 i1->h1 i2->h1 i3->h1
-1.51 -0.06 1.37 1.10
b->o h1->o
-0.72 2.40
pred <- predict(
result,
pred_cmd = c(
"time = 0:240", "fs = c('yes', 'no')",
"ccon = c('yes', 'no')"
)
)
plot(pred, xvar = "time", facet_col = "ccon", color = "fs")Neural Network
Data : uber
Response variable : price
Explanatory variables: time, fs, ccon
Prediction command : time = 0:240, fs = c('yes', 'no'), ccon = c('yes', 'no')
Rows shown : 10 of 964
time fs ccon Prediction
0 yes yes 3.004
1 yes yes 3.004
2 yes yes 3.004
3 yes yes 3.004
4 yes yes 3.004
5 yes yes 3.004
6 yes yes 3.003
7 yes yes 3.003
8 yes yes 3.003
9 yes yes 3.003
result <- nn(
uber,
rvar = "price",
evar = c("time", "fs", "ccon"),
type = "regression",
size = 2,
seed = 1234
)
summary(result, prn = TRUE)Neural Network
Activation function : Linear (regression)
Data : uber
Response variable : price
Explanatory variables: time, fs, ccon
Network size : 2
Parameter decay : 0.5
Seed : 1234
Network : 3-2-1 with 11 weights
Nr obs : 1,000
Weights :
b->h1 i1->h1 i2->h1 i3->h1
-1.40 0.59 0.69 1.39
b->h2 i1->h2 i2->h2 i3->h2
-0.53 1.04 -1.75 0.24
b->o h1->o h2->o
-0.12 2.02 -1.53
pred <- predict(
result,
pred_cmd = c(
"time = 0:240", "fs = c('yes', 'no')",
"ccon = c('yes', 'no')"
)
)
plot(pred, xvar = "time", facet_col = "ccon", color = "fs")Neural Network
Data : uber
Response variable : price
Explanatory variables: time, fs, ccon
Prediction command : time = 0:240, fs = c('yes', 'no'), ccon = c('yes', 'no')
Rows shown : 10 of 964
time fs ccon Prediction
0 yes yes 2.907
1 yes yes 2.908
2 yes yes 2.909
3 yes yes 2.910
4 yes yes 2.911
5 yes yes 2.912
6 yes yes 2.913
7 yes yes 2.914
8 yes yes 2.915
9 yes yes 2.917
\[ price = a + b \times time \]
\[ price = a + b \times time + c \times fs \]
\[ price = a + b \times time + c \times 0 = a + b \times time \]
\[ price = a + b \times time + c \times 1 = a + c + b \times time \]
\[ price = a + b \times time + c \times fs + d \times time \times fs \]
\[ a + b \times time + c \times 0 + d \times time \times 0 \\ = a + b \times time \]
\[ a + b \times time + c \times 1 + d \times time \times 1 \\ = a + c + (b + d) \times time \]
\[ price = a + b \times time + c \times fs + d \times ccon \]
\[ price = a + b \times time + c \times fs + d \times ccon + e \times fs \times ccon \]
Friday/Saturday, Not Comic-Con
\[ a + b \times time + c \times 1 + d \times 0 + e \times 1 \times 0 \\ = a + c + b \times time \]
Not Friday/Saturday, Not Comic-Con
\[ a + b \times time + c \times 0 + d \times 0 + e \times 0 \times 0 \\ = a + b \times time \]
Friday/Saturday, Comic-Con
\[ a + b \times time + c \times 1 + d \times 1 + e \times 1 \times 1 \\ = a + c + d + e + b \times time \]
Not Friday/Saturday, Comic-Con
\[ a + b \times time + c \times 0 + d \times 1 + e \times 0 \times 1 \\ = a + d + b \times time \]